Under construction
Under construction
Under construction
In late 2019, a virus first detected in Wuhan, China would set in motion a global pandemic which, by the end of 2021, will have killed 4.3 million people, infecting 238 million others, and disrupting the global economy across virtually every measurable dimension (WHO 2021). According to seven economic impact models constructed by McKibbin and Fernando (2020), estimates of the total global economic loss in terms of GDP are measured to be as large as 9.2 trillion U.S. dollars. However, despite a generally positive correlation between GDP and real estate prices (see image 1), historically measured as high as 98%, real estate market prices in the U.S. hit near-record heights following the outbreak of COVID-19 (Anissanti 2021). According to a report released by Zillow Analytics (Manhertz 2021), U.S. real estate gained 2.5 trillion dollars of value in 2020 alone, representing the largest single-year growth since 2005, despite an approximately 760 billion dollar decrease in GDP in the same year (FRED 2021). The focus of this thesis is to investigate the reasons behind why and how real estate market prices have broken trend and behaved so uncharacteristically counter cyclical in the face of a global pandemic.
In the following sections, I will apply the Hedonic Pricing Method (HPM) to Louisiana housing market data in order to inferentially describe the economic impact of the global pandemic on residential housing market values. Furthermore, I will take advantage of the HPM’s structural framework of using real estate properties’ hedonic features (e.g., size, age, number of bedrooms, etc.) to test for changes in demand for specific property features pre vs post pandemic. The HPM will be econometrically modeled using an Ordinary Least Squares (OLS) regressions framework for specific variable analysis while several variations of machine learning (ML) prediction models will be estimated to test different independent variables’ maximum explanatory power in predicting out-of-sample observations. The results of these models will shed light into the otherwise counter intuitive response of real estate pricing dynamics to the COVID-19 global pandemic.
The market value of a commodity is most often theoretically defined as the equilibrium price derived from the basic economic principal, or law, of supply and demand (Locke and Engels 1691; Epple 1987). However, the real estate market often violates this assumption due to its unique characteristics as an asset class (Wheaton 1999). For example, much of the underlying utility of a property is its use as a means of shelter by its owner (LING, OOI, and LE 2015). This rather unusual relationship to this asset introduces several behavioral biases which cause economic frictions not accounted for by traditional neoclassical economic theory (Nicolaides 1988). A notable example of behavioral bias impacting real estate price dynamics is the endowment effect. This behavioral finding was originally established by Kahneman, Knetsch, and Thaler (1990) in the late 20th century, and later applied to real estate markets by BAO and GONG (2016). The latter of the two stating that the predictably irrational behavior of market participants to overvalue their home due to sentimental attachment to the property forces market prices into sustained economic disequilibrium. Other highly cited and unusual characteristics are that real estate assets are very infrequently traded due to high transaction costs (Collett, Lizieri, and Ward 2003; Guilkey, Miles, and Cole 1989), governments tend to interfere, both directly and indirectly with real estate markets through the creation of fiscal and monetary policies (Bingyang, Jie, and Yinhan 2013; Du, Ma, and An 2011), and through creating renter-protections laws such as ‘squatter’s rights’ laws which allow a renter to remain in a home for extended periods of time long after they have stopped paying rent (Hoy and Jimenez 1991; Gardiner 1997).
The idiosyncratic asset features outlined in section 2.1 along with a high level of heterogeneity across many dimensions of the entire real estate asset class makes the creation of a generalized pricing model difficult and have led to a wide range of proposals and recommendations about what determines the market price of real estate assets and how to reliably model those pricing dynamics (Curcuru et al. 2010). In Pagourtzi et al. (2003), the authors outlines several of the currently accepted real estate valuation methods, ranging from what they categorize as the traditional methods, such as comparable-group, cost, income-multiple, profit-multiple, and contractor’s method, to the advanced methods, such as ANNs, spatial analysis methods, fuzzy logic, and the hedonic pricing method. According to a meta analysis conducted by Sirmans et al. (2006), currently, the most widely used and accepted advanced methodological framework for real estate valuation modeling is the Hedonic Pricing Method.
First applied in 1939 on automobile data, according to Goodman (1978), the HPM is a model which estimates the value of distinct characteristics of a commodity which directly or indirectly contribute to its market value. Besides its implementation in real estate finance and economics, such as in this thesis, this methodology has a wide range of applications such as its implementation in consumer and market research (Holbrook and Hirschman 1982; Arnold and Reynolds 2003), construction of consumer price indices (Moulton 1996; Schultze 2003), various tax assessments (Berry and Bednarz 1975; Bernasconi, Corazzini, and Seri 2014), automated automobile valuation (Cowling and Cubbin 1972; Matas and Raymond 2009), and computer sales (Dulberger 1987; Wakefield and Whitten 2006).
Since its introduction, the HPM has gain significant popularity among housing market and commercial real estate researchers. The specific real estate-based topics include, but are not limited to, the construction of housing price indices (Gouriéroux and Laferrère 2009; Wallace and Meese 1997), the estimation and prediction of a property’s market value in situations where market-transaction data is low-dimensional or non-existent (LeSage and Pace 2004), and, as in this thesis, the specific analysis of changes in the demand for specific property characteristics across time, subgroups, or both (Clapp and Giaccotto 1998). As the broad search for a satisfactory modeling framework focuses in on the HPM, another debate arises regarding the best functional form of this method. Traditionally utilizing the standard OLS framework (Pace and Gilley 1998), researchers are increasingly utilizing a variety machine learning algorithms to accomplish an increasingly more refined set of findings.
Unsurprisingly, regression analysis is the preferred estimation approach among real estate researchers when using HPT for price estimation. These multiple regression analysis methods are most often either an Ordinary Least Squares (OLS) regression or a Maximum Likelihood approximation of the log-likely equation derived directly from the hedonic function. Each of these estimation methods take a functionally similar path as they both estimate a vector of parameters (i.e. beta coefficients) that best fits the explanatory hedonic variables to the associated market price. They differ only by the loss function used in the identification of that best-fitted parameter vector.
The most commonly used hedonic price regression equation with respect to real estate markets models the relationship between market rents or market property values to a list of hedonic characteristics. The classical construction of this model according to (Herath, S. K. & Maier, G. (2010)) is the following:
\[ R = f(P,N,L,t) \] where \(R\) is rent or price of the property; \(P\) is property related attributes; \(N\) is neighborhood characteristics; \(L\) is locational variables and; \(t\) is an indicator of time.
Though first introduced by Turing (1950) under the broader umbrella term of artificial intelligence, the adoption of ML methods in real estate would take many years of software and hardware development, allowing or the subsequent collection of ever-larger data sets and central processing unites (CPUs) capable of processing the often extraordinary number of calculation required to produce a solution for a given algorithm (Dutta 2018). The primary advantage of ML techniques are that ML algorithms learn and improve over time and across many iterations and variable combinations, while tradition statistical and econometric techniques produce static results across a single model (Anguita et al. 2010).
Mohd et al. (2020) provides a thorough overview of the various applications of ML to real estate valuation methods, including the Ridge and Lasso regression techniques used in this thesis.
Work in Progress: Add more volume here. Possible addition of models?
In the wake of the COVID-19 crisis, there were several papers and articles regarding the economic impact of the global pandemic on the housing market being expeditiously publishes in virtually every major journal. In this section (2.3), I have selected the most relevant of these publications with respect to this thesis.
Structural and temporal changes in the housing market using hedonic methods (Shimizu et al. 2010),
Changes in housing market demand for specific property types and features (Tajani et al. 2021),
Potential changes in housing preferences due to the COVID-19 pandemic and highlight the challenges for policy making (Nanda et al. 2021)
Tonia Notes:
The utilization of Big Data collected through a data-mining process called web-scrapping has increasingly become the method of choice for researchers across disciplines. The term web-scrapping simply refers to the process of collecting structured data from websites using algorithms to automate the collection process. Methods similar to ones I have implemented in this thesis have been used by established authors such as Borde et al., Pérez-Rave et al. and Berawi et al.
In this thesis, I have used a mixture of the programming languages R and Python, supplemented by several packages created by Selenium, to write an algorithm that collects the required hedonic variables for this research from the Multiple Listing Services (MLS). Table 1 is a summary of the original data set’s key features.
| Variable List | |
|---|---|
| Structure and short discription | |
| Name | Information |
| Date Range | 23.10.2010 - 12.12.2021 |
| Location | Louisiana, USA |
| Number of Variables | 49 |
| Number of Observations | 31,280 |
| Pre-Corona Obs | 6,256 |
| Post-Corona Obs | 25,024 |
| Variable List | ||
|---|---|---|
| Structure and short discription | ||
| Variable Type | Variables | Observations |
| Continuous | 11 | 31184 |
| Factor | 38 | 31280 |
| Nominal Total | 49 | 31280 |
| Factor-Expanded Total | 114 | 31280 |
Though the data-collecting algorithms return structured data, it is nevertheless far from being suitable for the rather picky models which will eventually analyze them. Therefore, the following processes were completed in order to render the raw data into a usable form:
\[I\left(y\right)=\left\{\begin{array}{ll}1,\quad x\in A\\0, \quad x \notin A\end{array}\right.,\] Where \(I\) is an indicator function with space \(A\) that composes dummy variable \(x\) into \(1\) if the condition is met and into \(0\) if it is not.
The results of the data cleaning processes cane be seen in table 2.
| Variable List | ||
|---|---|---|
| Structure and short discription | ||
| Name | Raw | Clean |
| Date Range | 23.10.2010 - 12.12.2021 | 23.10.2010 - 12.12.2021 |
| Location | Louisiana, USA | Louisiana, USA |
| Number of Variables | 49 | 49 |
| Number of Observations | 31,280 | 24,412 |
| Pre-Corona Obs | 6,256 | 4,882 |
| Post-Corona Obs | 25,024 | 19,529 |
| Variable List | ||
|---|---|---|
| Structure and short discription | ||
| Variable Type | Variables | Observations |
| Continuous | 11 | 24412 |
| Factor | 38 | 24412 |
| Nominal Total | 49 | 24412 |
| Factor-Expanded Total | 103 | 24412 |
| Variable List | |||
|---|---|---|---|
| Structure and short discription | |||
| Count | Name | Structure | Discription |
| 1 | list_price | Number | Original listing price |
| 2 | photo_count | Number | Number of photos on listing |
| 3 | area_living | Number | Total living area in sqft. |
| 4 | land_acres | Number | Size of land in acres |
| 5 | area_total | Number | Total area in sqft. |
| 6 | age | Number | Age of property |
| 7 | dom | Number | Days on the market |
| 8 | sold_price | Number | Actual sold price |
| 9 | infections_daily | Number | Daily public corona infections |
| 10 | infections_accum | Number | Accumulation of public corona infections |
| 11 | infections_3mma | Number | 3-month moving average of daily public corona infections |
| 12 | sold_date | Date | Date on which the property was sold |
| 13 | beds_total | Factor | Total number of beds |
| 14 | bath_full | Factor | Total number of full bathrooms |
| 15 | bath_half | Factor | Total number of half bathrooms |
| 16 | property_type | Factor | Property type |
| 17 | property_condition | Factor | Property condition |
| 18 | property_style | Factor | Property Style |
| 19 | roof_type | Factor | Roof type |
| 20 | patio | Factor | Patio present |
| 21 | out_building | Factor | Detached building (e.g. shed) present |
| 22 | city_limits | Factor | Property within city limits |
| 23 | mls_number | Factor | MLS number (unique ID) |
| 24 | ac_type | Factor | Air conditioning type |
| 25 | school_general | Factor | School in city limits |
| 26 | pool | Factor | Pool present |
| 27 | gas_type | Factor | Gas type |
| 28 | appliances | Factor | Appliances included |
| 29 | garage | Factor | Garage present |
| 30 | energy_efficient | Factor | Energy-efficient features present |
| 31 | exterior_type | Factor | Exterior type |
| 32 | exterior_features | Factor | Exterior features |
| 33 | fireplace | Factor | Fireplace present |
| 34 | foundation_type | Factor | Foundation type (e.g. slab) |
| 35 | sewer_type | Factor | Sewer type |
| 36 | subdivision | Factor | Property within subdivision |
| 37 | water_type | Factor | Water supply type |
| 38 | waterfront | Factor | Property has waterfront |
| 39 | corona_date_split | Factor | Date of first mandatory lockdowns in Louisiana (i.e. 23.03.2020) |
| 40 | top25_sold_price | Factor | Top 25th percentile of sold price |
| 41 | top50_sold_price | Factor | Top 50th percentile of sold price |
| 42 | bottom25_sold_price | Factor | Bottom 25th percentile of sold price |
| 43 | top25_area_living | Factor | Top 25th percentile of total living area |
| 44 | bottom25_area_living | Factor | Bottom 25th percentile of Living area |
| 45 | top25_age | Factor | Top 25th percentile of total age |
| 46 | bottom25_age | Factor | Bottom 25th percentile of age |
| 47 | top25_dom | Factor | Top 25th percentile of days on market |
| 48 | bottom25_dom | Factor | Bottom 25th percentile of Days on Market |
| 49 | infections_period | Factor | Period after accumulated infections > 1000 cases |
The correlation matrix between all numeric variables shows that with exception to the variables which will have obvious correlations (e.g. sold_price and list_price, area_total and area_living, and infections figures), there are no other correlations which would cause concern.
As the standard descriptive characteristics of a particular variable are considered (i.e. measures of frequency, central tendency, dispersion, and position), the matrix of density plots below give us a good overview of the most relevant variables in this data set.
Under Construction
The overarching method used in this thesis is the Hedonic Pricing Method (HPM), also often referred to as hedonic regression or hedonic demand theory. The fundamental theory behind the HPM is the following: commodities are distinguishable by their component parts, therefore, the market value of a given commodity can be calculated by summing the estimated values of its separate characteristics. For this theory to hold true, several critical requirement must be met. Primarily, that the commodity being valued can be reduced down to it’s component parts and that the market is able to implicitly and independently value these characteristics. The fulfillment of these requirements are not obvious and in reality will in some measure fall short of accounting for the complete nature of price dynamics in practically every asset class. However, this limitation offers an interesting problem to test. Namely, to find the limit of the accumulated power of these component parts to account for market values and their deviations across time and subgroups. This exact questions will be later examined by implementing a machine-learned predictive model to measure the theoretical maximum explanatory power of the included hedonic variables. In the following two subsections, we review the methods used in this paper to econometrically model the HPM on hedonic real estate data.
In this section, I will outline the construction of my base OLS model, termed the Alpha model, as well as the treatment process for heteroskadasticity, multicolinearity, non-linearity, and high-leverage points and outliers.
Following the OLS construction laid out by Herath and Maier (2010):
\[ R = f(P,N,L,t) \] where \(R\) is rent or price of the property; \(P\) is property related attributes; \(N\) is neighborhood characteristics; \(L\) is locational variables and; \(t\) is an indicator of time.
This paper’s base OLS model, named the Alpha model, is as follows:
\[ {Sold \ Price}_{t} = \ A \ + \ \ B_1 {hello}_1 \ + \ \ B_2 {hello}_2s \]
\[ {P}_{n\times1,t} = \ A_{n\times1,t} \ + \ \ B_1{P}_1 \ + \ \ B_2 {hello}_2s \]
Where
\(\ A\)
A Breusch-Pagan test was conducted on a standard linear regression model with sold price as the dependent variable and the rest of the dataset as regressors. The Breusch-Pagan (BP) test was established as a method in 1979 and follows the logic set by the Lagrange multiplier test principle (Breusch and Pagan 1979). This test tests the null hypothesis that the variance in the model’s errors is independent from model’s regressors (i.e. heteroscadasticity). The test’s results in a rejections of the null hypothesis, thereby finding the base model to be heteroskedastic. The results of this test are summarized in table x.
| Breusch Paga Test for Heteroskedasticity | ||
|---|---|---|
| Hypotheses | ||
| Hypotheses | Test Summary | |
| Hypotheses | Test Summary | - |
| Ho: the variance is constant | DF | 1 |
| Ha: the variance is not constant | Chi2 | 850.4231 |
| - | Prob > Chi2 | 0.00 |
To resolve the heteroskadasticity found in §3.4.1, I will produce heteroskedasticity-consistent (HC) standard errors, also known as heteroskedasticity-robust standard errors, through the refined method established by econometrician Halbert Lynn White (White 1980).This process is as follows:
If the model’s errors \(u_{i}\) are independent, but have distinct variances \(\sigma _{i}^{2}\) then \(\Sigma =\operatorname{diag}(\sigma _{1}^{2},\ldots ,\sigma _{n}^{2})\) which can be estimated with \({\displaystyle {\widehat {\sigma }}_{i}^{2}={\widehat {u}}_{i}^{2}}\). This relationship produces the estimator found in White (1980):
\[ {\displaystyle {\begin{aligned}v_{\text{HCE}}\left[{\widehat {\beta }}_{\text{OLS}}\right]&={\frac {1}{n}}\left({\frac {1}{n}}\sum _{i}X_{i}X_{i}'\right)^{-1}\left({\frac {1}{n}}\sum _{i}X_{i}X_{i}'{\widehat {u}}_{i}^{2}\right)\left({\frac {1}{n}}\sum _{i}X_{i}X_{i}'\right)^{-1}\end{aligned}}}\] \[{\displaystyle {\begin{aligned}&=(\mathbb {X} '\mathbb {X} )^{-1}(\mathbb {X} '\operatorname {diag} ({\widehat {u}}_{1}^{2},\ldots ,{\widehat {u}}_{n}^{2})\mathbb {X} )(\mathbb {X} '\mathbb {X} )^{-1},\end{aligned}}}\]
where \(\mathbb{X}\) denotes the matrix of stacked \(X_i'\) values from the data. The estimator can be derived in terms of the generalized method of moments (GMM).
For the remainder of this thesis, all statements and figures regarding statistical significance will be referring to tests conducted with heteroskedasticity-consistent (HC) standard errors. As sample errors in my models will have equal variance and are uncorrelated, the least-squares estimates of each model’s beta coefficients are regarded as Best Linear Unbiased Estimators (BLUEs).
Multicolinearity is measured using Variance Inflation Factors (VIF). The VIF of a predictor measures how accurately that variable can be predicted using all other variables. For context, the square root of a VIF represents the increase in standard error of the estimated coefficient with respect to the case when that given variable is independent of all other variables. Inline with current convention, all variables with a VIF larger than 5 are eliminated. A graphical representation of all variable multicolinearity, measured by VIF, is shown in image x.
This test resulted in the elimination of the variables list_price and area_total and there were highly multicolinear with sold_price and area_living respectively.
I visual analysis was conducted on all continuous variables and non-linear variables transformation were added to age and area_living
An analysis of age vs. sold price shows a well-established u-shaped pattern. In order to allow the OLS model to better capture this relationship, a new variable \(age^2\) is added to the model.
Image x shows the Partial Dependency Plot (PDP) of age within the Alpha model. This plots the marginal prediction of the Alpha model across the full range of age. When the scale of the y-axis is reduced, we see the slight curvature in Alpha model’s estimation of age effects. This addition improved \(R^2\) by \(.08\).
An analysis of living-area vs. sold price reveals a Sigmoid pattern between the two variables. In order to allow the OLS model to better capture this relationship, a new variable \(living \ area^2\) is added to the model.
Image x shows the Partial Dependency Plot (PDP) of \(living \ area^2\) within the Alpha model. This addition improved \(R^2\) by \(.06\).
After the adjustments or the previous sections have been made, a panel of visualizations are run on the Alpha model. The results show no extreme outliers and only one high-leverage point (obs #23515), as shows by the residual vs. leverage plot in quadrant two of image X, which is removed in the final Alpha model. These results are mainly due to the previous removal of outliers in §3.2 and the overall quality of the data set.
Under Construction
The Alpha model is the baseline OLS for this thesis and is rebust to heteroskadasticity, multicolinearity, non-linearities, high-leverage points and outliers. These high-level adjustment increase our confidence in the statistical tests results which follow
The focus of this thesis is how the Covid crisis impacted housing prices and the relative levels of demand for specific hedonic features. In the case where the HPM is in the OLS functional form, the beta coefficients of this model represent relative demand for each associated hedonic feature (Shimizu et al. 2010). For example, \(\beta_{pool = 1} =11,856\) is interpreted as the average consumer’s willingness to pay for an average pool, ceteris parabus. However, this thesis wishes to measure the changes in the average consumer’s willingness to pay for a given feature (e.g. pool) in relationship a measurement of Covid’s economic impact.
A method of statistically comparing changes in the beta coefficients of particular features of interest. To accomplish this, a method outlined by the UCLA Statistics department is implemented (Bruin 2011). The best way to understand this method is to see a simplest repreducable example.
Assume we want to test the economic impact of Corona on the relative demand, \(\beta_{pool = 1}\), for swimming pools, \(pool\). Using the UCLA method, we test the null hypothesis \(H_0: \beta_{pool = 1,\ corona = 0} = \beta_{pool = 1,\ corona = 1}\) with the following OLS
\[
{sold \ price} = \alpha \ + \beta_1 {pool} \ + \ \beta_2 {corona} \ + \beta_3 {(pool \times corona)}
\]
Which results in the following:
The interpretation of this simplified model’s estimates are:
Therefore, we say:
The average premium for a property having a swimming pool fell by $7,766.40 when compared to pre-corona levels, ceteris parabus. However, this finding is only significant at the p < 0.10 level.
A good measurement for the response of the market to the corona crisis must be a corona-related matric which is publicly available, common knowledge to the population, and is a reasonable measurment of future economic shifts. For this reason, data collected from the Louisiana Department of Health (LaDH 2022) was used to calculate the 3-month moving average of corona infections (infections_3mma).
This measurement fulfills the previously stated requirements of a good test measurement as it is purely related to the corona crisis, is publicly available, is assumed to be publicly known as it is reported across all major news stations daily, and perhaps most importantly, is the primary metric used to decide when mandatory lockdowns are instituted. For this thesis, it is assume the market is responding to some lagged value of daily infections which are being used by consumers to estimate the likelihood of future lockdowns and the stringency, and duration of current lockdowns. With this rational, infections_3mma is selected as the primary measurement of corona’s impact.
In the previous subsection 4.1, it was stated that the multivariable regression models estimate a vector of parameters (i.e. beta coefficients) that best fit the explanatory hedonic variables to the associated dependent variable. Intuitively, the resulting fitted coefficient vector is fitted to the entire data set, and therefore, the loss function minimizes the error in the model’s ability to explain the very independent variable it was fitted to. In other terms, these results are ultimately limited to their inferential value within the exact context of the data set the model is trained on. If one is to establish a wider, more general relationship between dependent and independent variables that go beyond the context of the trained data set, supervised machine learning (ML) prediction models are an extremely powerful tool to do so. Though the models used in this thesis differ across several key processes, they each generally follow a similar logic:
Data Spliting and CV:
Data must first be split into ‘test’ and ‘train’ (validation) subsets. Each ML model will be fitted to the train data set and it’s performance will be evaluated based the model’s ability to predicted out-of-sample observations in the test (validation) data set. In this way, these models are ranked based on their test mean squared errors (MSE). This process is often referred to as Cross Validation (CV). The two most commonly used CV methods are the Validation Set Approach and K-Fold Cross Validation.
The Validation Set Approach (VSA) is the simplest case of cross validation data splitting as it simply randomly splits the entire data set into train and test subsets based on a certain percentage split. For example, the researcher can choose to split the data set with a 80% training and 20% testing split. The estimation for the test MSE is simply the test error against the test data set.
\[CV_{vsa} = MSE_{vsa}\]
The K-Fold CV method has increasingly been used by researches as it offers a more comprehensive cross validation process when compared to other methods. K-Fold CV is the process of randomly splitting the entire data set into k groups, or folds, each with approximately an equal number of observations. The first fold is held out and the model is trained on the remaining k-1 folds. This process is repeated k times, each time holding out a different fold until every fold has been treated as the validation set. Finally, this will results in k estimations of the model’s test error and the final estimation will be the average across all k model fits.
\[CV_{k-fold} = \frac{1}{k}\sum_{t=1}^{k} MSE_i\]
Now that a method of model fitting and evaluation has been established through Cross Validation, we now have a way to rank different models to each other based on their ability to estimate a true test MSE. With this extension, it is possible to test different combinations are available independent variables to search for the most accurate and least over-fitted model possible. The
Under Construction
The XGBoosting machine chosen and tuned for this thesis is programmed with the following algorithm:
Algorithm input: training set \({\displaystyle \{(x_{i},y_{i})\}_{i=1}^{N}}\), a differentiable loss function \(L(y, F(x))\) a number of weak learners \(M\) and a learning rate \(\alpha\).
Algorithm:
2.1 Initialize model with a constant value:
\[{\displaystyle {\hat {f}}_{(0)}(x)={\underset {\theta }{\arg \min }}\sum _{i=1}^{N}L(y_{i},\theta )}\]
2.2 For \(m = 1\) to \(M\):
2.2.1 Compute the ‘gradients’ and ’hessians \[{\displaystyle {\hat {g}}_{m}(x_{i})=\left[{\frac {\partial L(y_{i},f(x_{i}))}{\partial f(x_{i})}}\right]_{f(x)={\hat {f}}_{(m-1)}(x)}.}\] \[{\displaystyle {\hat {h}}_{m}(x_{i})=\left[{\frac {\partial ^{2}L(y_{i},f(x_{i}))}{\partial f(x_{i})^{2}}}\right]_{f(x)={\hat {f}}_{(m-1)}(x)}.}\]
2.2.2 Fit a base learner (or weak learner, e.g. tree) using the training set \({\displaystyle \displaystyle \{x_{i},-{\frac {{\hat {g}}_{m}(x_{i})}{{\hat {h}}_{m}(x_{i})}}\}_{i=1}^{N}}\) by solving the optimization problem:
\[{\displaystyle {\hat {\phi }}_{m}={\underset {\phi \in \mathbf {\Phi } }{\arg \min }}\sum _{i=1}^{N}{\frac {1}{2}}{\hat {h}}_{m}(x_{i})\left[-{\frac {{\hat {g}}_{m}(x_{i})}{{\hat {h}}_{m}(x_{i})}}-\phi (x_{i})\right]^{2}.}\]
\[{\displaystyle {\hat {f}}_{m}(x)=\alpha {\hat {\phi }}_{m}(x).}\]
2.3 Update the model: \[{\displaystyle {\hat {f}}_{(m)}(x)={\hat {f}}_{(m-1)}(x)+{\hat {f}}_{m}(x).}\]
Notes
library(gt)
library(tidyverse)
library(glue)
# Define the start and end dates for the data range
start_date <- "2010-06-07"
end_date <- "2010-06-14"
# Create a gt table based on preprocessed
# `sp500` table data
sp500 %>%
dplyr::filter(date >= start_date & date <= end_date) %>%
dplyr::select(-adj_close) %>%
gt() %>%
tab_header(
title = "S&P 500",
subtitle = glue::glue("{start_date} to {end_date}")
) %>%
fmt_date(
columns = vars(date),
date_style = 3
) %>%
fmt_currency(
columns = vars(open, high, low, close),
currency = "USD"
) %>%
fmt_number(
columns = vars(volume),
suffixing = TRUE
)
# Tables
table1 <- gt(df3)
table1 <- table1 %>% tab_header(title = md("Variable List"),
subtitle = md("*Structure and short discription*"))
table2 <- gt(df3)
table2 <- table1 %>% tab_header(title = md("Variable List"),
subtitle = md("*Structure and short discription*"))
# Combine 1 and 2
data_tables <- data.frame(table_1 = table1, table_2 = table2)
data_tables
data_tables %>%
gt() %>%
fmt_markdown(columns = TRUE) %>% #render cell contents as html
cols_label(table_1.Count = "Table 1",
table_2.Count = "Table 2")
hp_table <- function(x){
gt(x) %>%
data_color(columns = c("hp"),
colors = col_numeric(palette = "Blues",
domain = c(0, 400))) %>%
tab_options(column_labels.hidden = TRUE) %>%
as_raw_html() # return as html
}
End of Document